In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import optimize
from scipy import spatial
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
sns.set(rc={"figure.figsize": (15, 6)})
sns.set_palette(sns.color_palette("Set2", 10))
In [2]:
lalonde_data = pd.read_csv('lalonde.csv')
The problem we try to solve in the question 1 is evaluating the average causal effect of the "treatment" represented by the job training program.
A naive analysis would only compare the difference in mean between the two groups (with and without treatment). By doing so, this only reflect both the average causal effect (ACE) and the selection bias (SB). The latter might drastically change the two averages we are comparing and could lead to a wrong conclusion.
In order to minimize the role of the selection bias, we use the propensity score matching (PSM) technique. The idea is to compare the difference in mean between subsets of the two groups that are similar.
In [3]:
#Function that plots a boxplot for re78
def compare_groups(data):
plt.figure(figsize=(10,10))
sns.boxplot(x='treat', y='re78', data=data, showfliers=False, showmeans=True, meanline=True, meanprops=dict(color='r'))
plt.xticks(range(2), ["Control Group", "Treatment Group"])
plt.show()
In [4]:
compare_groups(lalonde_data)
In [5]:
#We keep track of the ratio (Treatment group real earnings in 1978 mean) / (Control group real earnings in 1978 mean) after each improvement done in exercise 1
means_ratio_over_improvement = []
#A function that prints the mean of real earnings in 1978 in both group
def print_means(data):
data_means = data.groupby("treat").agg(np.mean)
print("Control group real earnings in 1978 mean: {:.0f}".format(data_means["re78"].loc[0]))
print("Treatment group real earnings in 1978 mean: {:.0f}".format(data_means["re78"].loc[1]))
ratio = data_means["re78"].loc[1]/data_means["re78"].loc[0]
means_ratio_over_improvement.append(ratio)
print("Ratio (treatment/control): {:.2f}".format(ratio))
In [6]:
print_means(lalonde_data)
A naive analysis would claim that there are no clear differences between the two groups and thus would conclude that the "Job Training Program" (JTP) is useless. And if a difference exists, people in the treatment groups have a smaller revenue by 10%, hence the treatment would be worst than no treatment at all.
In [7]:
#Features of each group
main_variables = ['black', 'hispan', 'age', 'married', 'nodegree', 'educ']
#Function that displays a bar plot of each group for every features
def display_proportions(data, variables=main_variables, n_cols=3):
N = len(variables)
f, axes = plt.subplots(nrows=int(np.ceil(N / n_cols)), ncols=n_cols)
f.set_figheight(10)
for idx, axis, var in zip(range(N), axes.flatten(), variables):
sns.barplot(x='treat', y=var, data=data, ax=axis)
axis.set_xticklabels(["Control Group", "Treatment Group"])
axis.set_xlabel("")
axis.set_title(idx+1)
axis.set_ylabel("mean of {}".format(var))
In [8]:
display_proportions(lalonde_data)
1: As we can see on the barplot above, the concentration of black people in the treatment group is 4 times as high as in the control group
2: The concentration of hispanic people in the control group is more than twice as high as in the treatment group
3: Treatment group is on average 2 years younger that control group
4: People in the control group are more than twice as likely to be married than the ones in the treatment group
5: The proportion of people without a degree in the treatment group is higher by 20% than in the control group
6: The mean and the variance of the of years of education is more or less the same in both groups
With these 6 observations, we can say that that two group are not uniformly separated and that for this reason, it is dangerous to draw a conclusion from a superficial analysis.
Let's see whether each group has a similar number of sample:
In [9]:
lalond_count = lalonde_data.groupby("treat").agg("count")
print("Number of people in the control group: {}".format(lalond_count["re78"].loc[0]))
print("Number of people in the treatment group: {}".format(lalond_count["re78"].loc[1]))
As we can see there is 2.3 times as many sample for the control group. And because of this, we can be picky and select only a part of the samples in the control group that correspond to the samples in the treatment group. To do so, we will match two samples together from each groups corresponding to their propensity score and then only keep and compare the samples matched.
Let's calculate the propensity score
In [10]:
from sklearn.linear_model import LogisticRegression
In [11]:
lr = LogisticRegression()
#Select features, that is drop id and treat columns
selectedFeatures = lalonde_data.drop(['id','treat'], axis=1)
#Fit the model
lr.fit(selectedFeatures, lalonde_data['treat']);
In [12]:
#Calculate the propensity scores
propensity_scores = lr.predict_proba(selectedFeatures)
In [13]:
#Only keep the probability of receiving the treatment and store it inside the dataframe
lalonde_data['propensity score'] = [x[1] for x in propensity_scores]
In [14]:
#One dataframe per group
control_group = lalonde_data[lalonde_data['treat'] == 0]
treatment_group = lalonde_data[lalonde_data['treat'] == 1]
In [15]:
#Compute the distance matrix using the absolute difference of the propensity scores
cost_matrix = spatial.distance.cdist(
treatment_group["propensity score"].values.reshape((treatment_group.shape[0], 1)),
control_group["propensity score"].values.reshape((control_group.shape[0], 1)),
metric=lambda a,b: np.abs(a - b)
)
In [16]:
#Solve the distance matrix to minimze the total cost function. Where the total cost function is the sum of the distances
#And get the indices of the pairs that minimze this total cost function
treatment_ind, control_ind = optimize.linear_sum_assignment(cost_matrix)
In [17]:
#We construct a dataframe whith the rows corresponding to the indices obtaiend above. Note we have the same number of sample in each group by construction
lalonde_ps_matched = pd.concat((treatment_group.iloc[treatment_ind], control_group.iloc[control_ind]))
Now, lets compare the difference in the distribution for each feature in the two groups as done earlier in part 1.2
In [18]:
display_proportions(lalonde_ps_matched)
1: The difference in the concentration of black people shrinked, however the treatment group's rate is almost still twice the rate of the control group (better than before)
2: The concentration of hispanic people in the control group is now twice as high as in the treatment group (better than before)
3: The control group is on average 2 years younger than the treatment group (same as before, but reversed)
4: People in the control group have now almost the same probability to be married as the ones in the treatment group (better than before)
5: The proportion of people without a degree in the treatment group is higher by 5% than in the control group (less than before (20%) )
6: The mean and the variance of the of years of education is again more or less the same in both groups
Compared to before the matching, the different features are more balanced. The only features that are not roughtly the same are the features that have a racial information in them.
In [19]:
compare_groups(lalonde_ps_matched)
In [20]:
print_means(lalonde_ps_matched)
We can now see that the mean in the treatment group is slightly higher than in the control group, where it was slightly below before. Also the maximum, median and quartiles are all bigger than their counterpart in the control group. This is a complete different information from what we had before, but let's improve it even more.
The main difference in the two groups resides in the proportion of hispanic and black people:
For this reason, we will add the condition when matching two subjects that they have the same value for the hispanic feature. Doing so for the black feature is not possible because 156 people out of the 185 people are black in the treatment group where for the control group there are 87 black people out of the 429 people.
In [21]:
additionnal_feature_matched = 'hispan'
#Compute the distance matrix where a value is 0 if both the row and the colum is hispan, 1 otherwise
add_cost_matrix = spatial.distance.cdist(
treatment_group[additionnal_feature_matched].values.reshape((treatment_group.shape[0], 1)),
control_group[additionnal_feature_matched].values.reshape((control_group.shape[0], 1)),
metric=lambda a,b: int(a != b)
)
In [22]:
#Solve the distance matrix (obtained by adding the propensity score distance matrix to the hispan distance matrix) to minimze the total cost function.
#Where the total cost function is the sum of the distances
#And get the indices of the pairs that minimze this total cost function
treatment_ind_2, control_ind_2 = optimize.linear_sum_assignment(cost_matrix + add_cost_matrix)
In [23]:
#We construct a dataframe whith the rows corresponding to the indices obtaiend above. Note we have the same number of sample in each group by construction
lalonde_ps_matched_2 = pd.concat((treatment_group.iloc[treatment_ind_2], control_group.iloc[control_ind_2]))
In [24]:
display_proportions(lalonde_ps_matched_2)
The proportion of hispanic people in the two groups is now the same and the only feature that is now unbalanced is the proportion of black people.
In [25]:
compare_groups(lalonde_ps_matched_2)
In [26]:
print_means(lalonde_ps_matched_2)
The difference in the salaries we perceived in part 1.4 increased, but not significantly.
Based on this difference, we could say that the "Job Training Program" (JTP) is slightly useful and has a positive effect on average on the salary of the people who took the program. We still have a selection bias by having way more black people in the treatment group and hence any conclusion drawn from these data will be biased. Shrinking the number of samples taken in each group so that we only match hispan with hispan and black with black in each group would result in such a small set that it would not be possible to draw any conclusion.
However it is good to point how far we are from the naive analysis realised in point 1. We had that the mean of the treatment group real earnings in 1978 was 10% lower than the one of the control group. However after we refined the way to analyse the data using propensity score and then late one with matching hispan people with hispan people only, we see that the mean of the treatment group real earnings in 1978 is 10% higher than the one of the control group. This example perfectly shows how a naive analyse could show wrong result. Indeed we go from "the treatment is worst" to "The treatment is worth"
Below you can find a barplot summary of the ratio of the means
In [27]:
#Plot the means we recorded after each improvement
sns.barplot(y=means_ratio_over_improvement, x = ["Naive", "Propensity score", "Propensity score + hispan matching"])
Out[27]:
In [28]:
from sklearn import metrics
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from time import time
In [29]:
#Loading data
all_news = fetch_20newsgroups(subset='all')
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.5, sublinear_tf=True)
In [30]:
#Vectorizing
news_data = vectorizer.fit_transform(all_news.data)
news_target = all_news.target
news_target_names = all_news.target_names
feature_names = vectorizer.get_feature_names()
In [31]:
# this could have been done in a simpler way for this homework,
# but it might be useful to have such a powerful function for other uses,
# hence we decide to keep it here so that other could use it too :)
def split(X, y, ratios):
"""
Split X and y given some ratios
Parameters
----------
X : ndarray
train matrix
y : ndarray
test matrix
ratios : list(int)
ratios on how to split X and y
Returns
-------
out : tuple(ndarray)
Output one tuple of first, the splits of X and then, the splits of y
"""
assert np.sum(ratios) < 1, "sum of ratios cannot be greater than 1"
assert len(ratios) >= 1, "at least one ratio required to split"
def inner_split(X, y, ratios, acc_X, acc_y):
ratio, *ratios_remaining = ratios
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=ratio)
if len(ratios_remaining) == 0:
acc_X.extend([X_train, X_test])
acc_y.extend([y_train, y_test])
acc_X.extend(acc_y)
return tuple(acc_X)
else:
acc_X.append(X_train)
acc_y.append(y_train)
return inner_split(X_test, y_test, [r/(1.0 - ratio) for r in ratios_remaining], acc_X, acc_y)
return inner_split(X, y, ratios, [], [])
In [32]:
def predict(clf, X_train, y_train, X_test):
"""
Using a classifier, train with training data it using to fit testing labels
and then predict some labels of testing data.
It also times the different steps.
Parameters
----------
clf: sklearn classifier
classifier
X_train: ndarray
training data
y_train: ndarray
training labels
X_test: ndarray
testing data
Returns
-------
out : ndarray
Output the prediction of labels
"""
start_time = time()
print("Prediction computations started...")
clf.fit(X_train, y_train)
train_time = time() - start_time
pred = clf.predict(X_test)
prediction_time = time() - train_time - start_time
print("...Finished")
print("Training time = {}s".format(round(train_time)))
print("Prediction time = {}s".format(round(prediction_time // 1)))
return pred
In [33]:
def report(results, n_top=3, compared_to=10):
"""
Print the parameters of the best grid search cross-validation results
and plot their accuracy compared to another accuracy score.
Parameters
----------
results: sklearn grid search cv_results_
grid search cross-validation results
n_top: int
the number of best results to plot
compared_to: int
the nth best results to compare the best results with
Returns
-------
out : None
Output some prints and a plot
"""
means = []
stds = []
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
mean = results['mean_test_score'][candidate]
std = results['std_test_score'][candidate]
means.append(mean)
stds.append(std)
print("Model with rank: {}".format(i))
print("Mean validation score: {0:.4f} (std: {1:.4f})".format(mean, std))
print("Parameters: {}".format(results['params'][candidate]))
min_ = np.min(results['mean_test_score'][results['rank_test_score'] == (compared_to)])
print('\n{0:}\'th score = {1:.4f}'.format(compared_to, min_))
means = np.array(means) - min_
plt.title("Top {0} best scores (compared to the {1}'th score = {2:.3f})".format(n_top, compared_to, min_))
plt.bar(range(n_top), means, yerr=stds, align="center")
plt.xticks(range(n_top), range(1, n_top + 1))
plt.xlabel("n'th best scores")
plt.ylabel("score - {}'th score".format(compared_to))
plt.show()
In [34]:
ratios = [0.8, 0.1] #Ratio is 0.8 for train and twice 0.1 for test and validation
X_train, X_test, X_validation, \
y_train, y_test, y_validation = split(news_data, news_target, ratios)
In [35]:
# use a full grid over max_depth and n_estimators parameters
param_grid = {
"max_depth": [3, 10, 20, None],
"n_estimators": np.linspace(3, 200, num=5, dtype=int)
#"max_features": [1, 3, 10],
#"min_samples_split": [2, 3, 10],
#"min_samples_leaf": [1, 3, 10],
#"bootstrap": [True, False],
#"criterion": ["gini", "entropy"]
}
# run grid search
grid_search = GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
grid_search.fit(X_validation, y_validation)
None #No output cell
After having computed an estimation of our model with many different parameters we choose the best parameters (comparing their mean score and std)
In [36]:
report(grid_search.cv_results_, n_top=5, compared_to=10)
Let's save the parameters which give the best result inside a variable
In [37]:
rank_chosen = 1 #Position of the parameters we choose
best_params = grid_search.cv_results_['params'][np.flatnonzero(grid_search.cv_results_['rank_test_score'] == rank_chosen)[0]]
We reuse the optimal parameters computed above to produce prediction with a random forest classifier
In [38]:
random_forest_clf = RandomForestClassifier(**best_params)
pred = predict(random_forest_clf, X_train, y_train, X_test)
In [39]:
#Choose the average type
average_type = "weighted"
#Get the different scores of the predicion computed above
accuracy = metrics.accuracy_score(y_test, pred)
precision = metrics.precision_score(y_test, pred, average=average_type)
recall = metrics.recall_score(y_test, pred, average=average_type)
f1_score = metrics.f1_score(y_test, pred, average=average_type)
print("accuracy = {:.4f}".format(accuracy))
print("precision = {:.4f}".format(precision))
print("recall = {:.4f}".format(recall))
print("f1_score = {:.4f}".format(f1_score))
As one can see, neither precision, recall or f1_score are adding information. This is because there are quite many classes (20) which are uniformly distributed :
In [40]:
classes = range(len(news_target_names))
def sum_by_class(arr):
return np.array([np.sum(arr == i) for i in classes])
test_sum_by_class = sum_by_class(y_test)
val_sum_by_class = sum_by_class(y_validation)
train_sum_by_class = sum_by_class(y_train)
p1 = plt.bar(classes, test_sum_by_class)
p2 = plt.bar(classes, val_sum_by_class, bottom=test_sum_by_class)
p3 = plt.bar(classes, train_sum_by_class, bottom=test_sum_by_class + val_sum_by_class)
plt.xticks(classes, news_target_names, rotation='vertical')
plt.tick_params(axis='x', labelsize=15)
plt.legend((p1[0], p2[0], p3[0]), ('test', 'validation', 'train'))
plt.show()
The plot above shows that every class is well represented in the test, training and validation sets.
Let's show the confusion matrix
In [41]:
import itertools
# A function to plot the confusion matrix, taken from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
In [42]:
cnf_matrix = metrics.confusion_matrix(y_test, pred)
# Plot non-normalized confusion matrix
plt.figure(figsize=(25, 15))
plot_confusion_matrix(cnf_matrix, classes=news_target_names, title='Confusion matrix, without normalization')
In [43]:
# Plot normalized confusion matrix
plt.figure(figsize=(25, 15))
plot_confusion_matrix(cnf_matrix, classes=news_target_names, normalize=True, title='Normalized confusion matrix')
What the confusion matrices show is that we did a pretty good joob at assignating the categories except we categorized quite a lot of things in religion.christian instead of religion.misc which is understandable because both categories are closely related. Also atheism is closlely related to religion hence the above average value for ths category but it is still a small value. The last part where we could have done better is with every topics about technology (pc.hardware, mac.hardware, etc.) which is again topics that are very closely related. But overall our classifier can categorize correctly a news and if not it classifies it in a category closely related to the correct one.
Let's see what information the feature_importances_ attribute can provide us
In [44]:
importances = random_forest_clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in random_forest_clf.estimators_], axis=0)
#Sort the feature by importance
indices = np.argsort(importances)[::-1]
print("Total number of features = {}".format(len(indices)))
In [45]:
# Only most important ones (out of thousands)
num_best = 20
best_indices = indices[:num_best]
best_importances = importances[best_indices]
best_std = std[best_indices]
# Plot the feature importances
plt.figure()
plt.title("20 best feature importances")
plt.bar(range(num_best), best_importances, yerr=best_std, align="center")
plt.xticks(range(num_best), np.array(feature_names)[best_indices], rotation='vertical')
plt.tick_params(axis='x', labelsize=15)
plt.xlim([-1, num_best])
plt.xlabel("Feature indices")
plt.ylabel("Feature names")
plt.show()
What we see is that the important features are the ones that could easily permit to exclude a news from some categories because there is an extremely small chance these words appear in a news about those categories. For example it is very unlikely a news about religon talk of "sale", however it is very likely for a news about technology.
The third feature, 'dod', might surprise because it is not a word we hear or read often about. However, if we look it up on the web, we find that it refers the the 'Department of Defense' which is clearly a word only contained in news about politics. This reenforces our insight on the importance of features.